Deciphering regulatory architectures from synthetic single-cell expression patterns

For the vast majority of genes in sequenced genomes, there is limited understanding of how they are regulated. Without such knowledge, it is not possible to perform a quantitative theory-experiment dialogue on how such genes give rise to physiological and evolutionary adaptation. One category of high-throughput experiments used to understand the sequence-phenotype relationship of the transcriptome is massively parallel reporter assays (MPRAs). However, to improve the versatility and scalability of MPRA pipelines, we need a “theory of the experiment” to help us better understand the impact of various biological and experimental parameters on the interpretation of experimental data. These parameters include binding site copy number, where a large number of specific binding sites may titrate away transcription factors, as well as the presence of overlapping binding sites, which may affect analysis of the degree of mutual dependence between mutations in the regulatory region and expression levels. To that end, in this paper we create tens of thousands of synthetic single-cell gene expression outputs using both equilibrium and out-of-equilibrium models. These models make it possible to imitate the summary statistics (information footprints and expression shift matrices) used to characterize the output of MPRAs and from this summary statistic to infer the underlying regulatory architecture. Specifically, we use a more refined implementation of the so-called thermodynamic models in which the binding energies of each sequence variant are derived from energy matrices. Our simulations reveal important effects of the parameters on MPRA data and we demonstrate our ability to optimize MPRA experimental designs with the goal of generating thermodynamic models of the transcriptome with base-pair specificity. Further, this approach makes it possible to carefully examine the mapping between mutations in binding sites and their corresponding expression profiles, a tool useful not only for better designing MPRAs, but also for exploring regulatory evolution.


Introduction
With the widespread emergence of sequencing technology, we have seen an explosion of genomic data in recent years.However, data on transcriptional regulation remains far behind.Even for organisms as widely studied as E. coli, many promoters lack annotations on the transcription factor binding sites that underlie transcriptional regulation.Moreover, existing binding site annotations are largely without experimental validation for functional activity, as a large proportion are determined through DNA-protein interaction assays such as ChIP-Seq [1][2][3][4] or computational prediction [5].This fundamental gap in knowledge poses a major obstacle for us to understand the spatial and temporal control of cellular activity, as well as how cells and organisms respond both physiologically and evolutionarily to environmental signals.
One strategy to understand the regulatory genome is by conducting massively parellel reporter assays (MPRAs), where the regulatory activities of a library of sequences are measured simultaneously via a reporter.The library of sequences may be genomic fragments [6] or sequence variants containing mutations relative to the wild-type regulatory sequence [7].There are two main ways to measure regulatory activities in MPRAs.The first approach uses fluorescence-activated cell sorting to sort cells into bins based on the expression levels of a fluorescent reporter gene [8].Subsequently, deep sequencing is utilized to determine which sequence variant is sorted into which bin.The second approach uses RNA-sequencing (RNA-Seq) to measure the counts of barcodes associated with each sequence variant as a quantitative read-out for expression levels.The two approaches have been used in both prokaryotic [9][10][11][12] and eukaryotic systems [13][14][15] to study diverse genomic elements including promoters and enhancers.In particular, our group has developed Reg-Seq [16], an RNA-Seq-based MPRA where the authors successfully deciphered the regulatory architecture of 100 promoters in E. coli, with the hope now to complete the regulatory annotation of entire bacterial genomes.
Mutations in regulatory elements lead to reduced transcription factor binding, which may result in measurable changes in expression.Therefore, the key strategy to annotate transcription factor binding sites based on MPRA data that we focus on is to identify sites where mutations have a high impact on expression levels.To do this, one approach is to calculate the mutual information between base identity and expression levels at each site.A so-called information footprint can then be generated by plotting the mutual information at each position along the promoter.Positions with high mutual information are identified as putative transcription factor binding sites.
In this paper, we develop a computational pipeline that simulates the RNA-Seq-based MPRA pipeline.Specifically, we make use of equilibrium statistical mechanics to build synthetic datasets that simulate the experimental MPRA data and examine how various parameters affect the output of MPRAs.These parameters include experimental parameters, such as the rate of mutation used to generate the sequence variants, as well as biological parameters such as transcription factor copy number, where a large number of specific binding sites may titrate away transcription factors.This computational pipeline will help us to optimize MPRA experimental design with the goal of accurately annotating transcription factor binding sites in regulatory elements, while revealing the limits of MPRA experiments in elucidating complex regulatory architectures.Additionally, the insights gained from our simulation platform will enable further dialogue between theory and experiments in the field of transcription including efforts to understand how mutations in the evolutionary context give rise to altered gene expression profiles and resulting organismal fitness.
The remainder of this paper is organized as follows.In Sec 1.1, we will introduce our procedure to construct and analyse synthetic datasets for both promoters regulated by a single transcription factor and promoters regulated by a combination of multiple transcription factors.In Sec 1.2 and Sec 1.3, we discuss the choice of parameters related to the construction of the mutant library, including the rate of mutation, mutational biases, and library size.After setting up the computational pipeline, we perturb biological parameters and examine how these perturbations affect our interpretation of information footprints.The parameters that we will explore include the free energy of transcription factor binding (Sec 2.1), the regulatory logic of the promoter (Sec 2.2), the copy number of the transcription factor binding sites (Sec 2.3), and the concentration of the inducers (Sec 2.4).Next, we explore factors that may affect signal-to-noise ratio in the information footprints.These factors include stochastic fluctuations of transcription factor copy number (Sec 3.1), spurious binding events (Sec 3.2), as well as the presence of overlapping binding sites (Sec 3.3).Finally, we discuss the insights generated from our computational pipeline in relation to future efforts to decipher regulatory architectures in diverse genomes.

Results
1 Mapping sequence specificity and expression levels

Computational pipeline for deciphering regulatory architectures from first principles
In MPRA pipelines, the goal is to make the connection between regulatory sequences, transcription factor binding events, and expression levels.In Reg-Seq for example, the authors start with a library of sequence variants for an unannotated promoter, each of which contains a random set of mutations relative to the wild type sequence.Then, RNA-Seq is used to measure the expression levels of a reporter gene directly downstream of each promoter.By calculating the mutual information between mutations and the measured expression levels, the regulatory architecture of the promoter can be inferred.Finally, Bayesian models and thermodynamic models can be built using statistical mechanics to infer the interaction energies between transcription factors and their binding sites in absolute k B T units at a base-by-base resolution [16].
Our computational MPRA pipeline involves similar steps, but instead of starting from experimental measurements of expression levels, we use thermodynamic models to predict expression levels given the sequences of the promoter variants and the corresponding interaction energies, as schematized in Fig 1 .Through this process, we generate synthetic datasets of expression levels that are in the same format as the datasets that we obtain via RNA-Seq.Subsequently, we analyse the synthetic datasets in the same way as we would analyse an experimental dataset.Importantly, we can perturb various experimental and biological parameters of interest within this pipeline and examine how changing these parameters affect our ability to discover unknown transcription factor binding sites through MPRAs.
We first demonstrate our computational pipeline using a promoter with the simple repression regulatory architecture, i.e. the gene is under the regulation of a single repressor.Specifically, we use the promoter sequence of LacZYA.We assume that it is transcribed by the 70 RNAP and only regulated by the LacI repressor, which binds to the O1 operator within the LacZYA promoter.
For a gene with the simple repression regulatory architecture, the probability of RNAP being bound [1] is given by where k B is Boltzmann's constant and T is temperature.As we can see, the parameters that we need are the copy number of RNAP (P ), the copy number of repressor (R), the number of non-binding sites (N NS ), and the binding energies for RNAP and the repressors ( " pd and " rd ), respectively.We begin by assuming that P and R are constant and N NS is the total number of base pairs in the E. coli genome.On the other hand, the values for " pd and " rd depend on the sequence of the promoter variant.
We calculate the binding energies by mapping the sequence of the promoter variant to the energy matrices of the RNAP and the repressor, as shown in Fig 2(A).Specifically, we assume that binding energies are additive.Therefore, given a sequence of length l, the total binding energy " can be written as where " i,bi is the binding energy corresponding to base identity b i at position i according to the energy matrix.
It should be acknowledged that the additive model does not take into account epistasis effects [18,19].It may be beneficial to include higher-order interaction energy terms in future simulations of MPRA pipelines.
After computing the sequence-specific binding energies, we can then substitute the relevant energy terms into Eqn 1 and calculate the probability of RNAP being bound.Here, we use the energy matrices of the RNAP and LacI that were previously experimentally determined using Sort-Seq [20,21], as shown in Fig 2(B).Unless otherwise specified, these energy matrices are used to build all synthetic datasets in the remainder of this paper.
To connect the probability of RNAP being bound to expression levels, we make use of the occupancy hypothesis, which states that the rate of mRNA production is proportional to the probability of RNA polymerase occupancy at the promoter [22].The rate of change in mRNA copy number is given by the January 29, 2024 Given (1) knowledge or assumptions about the regulatory architecture of a promoter, we make use of ( 2) thermodynamic models to construct a states-and-weights diagram, which contains information about all possible states of binding and the associated Boltzmann weights.Here, in the states-and-weights diagram, P is the copy number of RNAP, R is the copy number of the repressor, N NS is the number of non-binding sites, " pd and " rd represent the binding energies of RNAP and the repressors at their specific binding sites relative to the non-specific background, respectively.Using these states-and-weights diagrams as well as the energy matrices, which are normalized to show the change in binding energies for any mutation along the promoter compared to the wild-type sequence, we can (3) predict the expression levels for each of the promoter variants in a mutant library.To recover the regulatory architecture, we (4) calculate the mutual information between the predicted expression levels and mutations at each position along the promoter according to Eqn 6.In particular, there is high mutual information if a mutation leads to a large change in expression and there is low mutual information if a mutation does not lead to a significant change in expression.The mutual information at each position is plotted in an information footprint, where the height of the peaks corresponds to the magnitude of mutual information, and the peaks are colored based on the sign of expression shift, defined in Eqn 9. Given the assumption that the positions with high mutual information are likely to be RNAP and transcription factor binding sites, we (5) recover the regulatory architecture of the promoter.difference between the rates of mRNA production and degradation.In general, there can be multiple transcriptionally active states, each with its own transcription rate.Therefore, we define an average rate of mRNA production, which is given by the sum of each state's production rate, weighted by the probability of the state.Hence, the rate of change of mRNA copy number is given by where for transcriptionally active state i, r i is the rate of transcription, p bound,i is the probability of RNA polymerase occupancy in state i, m is the copy number of mRNAs, and is the rate of mRNA degradation.Therefore, the steady-state level of mRNA is given by Here, for simplicity, we assume that each transcriptionally active state has the same rate of mRNA January 29, 2024 non-specific binding specific binding (B) Position in binding site Mapping binding site sequences to binding energies using energy matrices.(A) Given the assumption that binding energies are additive, we can use an energy matrix to determine how much energy each base along the binding site contributes and compute the total binding energy by taking the sum of the binding energies contributed by each position.The total binding energy can be used to compute the Boltzmann weight for each of the states, which is then used to calculate the probability of RNAP being bound.(B) Experimentally-measured energy matrices of RNAP [20], the LacI repressor [21], and the CRP activator.[8] production, r.Therefore, Using the above expression, we can calculate the expected RNA count for each of the promoter variants in our library.Assuming that r and do not depend on the mRNA sequence, the total probability of RNAP being bound, given by p bound = P p bound,i , is scaled by the same constant to produce the mRNA count of each promoter variant.Therefore, the choice of ↵ does not affect our downstream calculations involving the probability distribution of expression levels.Depending on the sequencing depth, the mRNA count for each promoter variant is typically on the order of 10 to 10 3 [16].Here, we take the geometric mean and set ↵ to be 10 2 to ensure that the mRNA count is on a realistic scale.
Up until this point, we have constructed a synthetic RNA-Seq dataset containing the predicted expression levels of each sequence variant in a mutant library.Next, we recover the regulatory architecture from this synthetic dataset.To do this, we calculate the mutual information between mutations and expression levels.The mutual information at position i is given by where b represents base identity, µ represents expression level, Pr i (b) is the marginal probability distribution of mutations at position i, Pr(µ) is the marginal probability distribution of expression levels across all promoter variants, and Pr i (b, µ) is the joint probability distribution between expression levels and mutations at position i.
In general, b can be any of the four nucleotides, i.e. b 2 {A, C, G, T}.This means that Pr i (b) is obtained by computing the frequency of each base per position.Alternatively, a more coarse grained approach can be taken, where the only distinction is between the wild-type base and mutation, in which case b is defined as b = ( 0, if the base is mutated, 1, if the base is wild type.(7) As shown in S1 Appendix, using the coarse-grained definition of b improves the signal-to-noise ratio of the information footprint as reducing the number of states reduces articifical noise outside of the specific binding sites.Therefore, we use this definition for our subsequent analysis.
To represent expression levels as a probability distribution, we group sequences in each range of expression levels into discrete bins and compute the probabilities that a given promoter variant is found in each bin.As shown in S1 Appendix, we found that increasing the number of bins leads to a lower signal-to-noise ratio in the information footprints because the additional bins contribute to artificial noise.
Therefore, we choose to use only two bins with the mean expression level as the threshold between them.
This means that µ can take the values of µ = ( 0, if expression is lower than mean expression 1, if expression is higher than mean expression.
In S2 Appendix, we derive the information footprint for a constitutive promoter analytically and demonstrate that in the absence of noise, mutual information is expected to be 0 outside of the specific binding sites and non-zero at a specific binding site.
To decipher the regulatory architecture of a promoter, another important piece of information is the direction in which a mutation changes expression.This can be determined by calculating the expression shift [23], which measures the change in expression when there is a mutation at a given position.Suppose there are n promoter variants in our library, then the expression shift s l at position l is given by Here, c i represents the RNA count of the i-th promoter variant.⇠ i,l = 0 if the base at position l in the i-th promoter variant is wild type and ⇠ i,l = 1 if the base is mutated.If the expression shift is positive, it indicates that mutations lead to an increase in expression and the site is likely to be bound by a repressor.On the other hand, a negative expression shift indicates that mutations lead to a decrease in expression, and therefore the site is likely to be bound by RNAP or an activator.
By calculating the mutual information and expression shift at each base position along the promoter, we can plot an information footprint for a promoter with the simple repression genetic architecture, as shown in Fig 3(B).There are two peaks with negative expression shifts near the -10 and -35 positions, which correspond to the canonical RNAP binding sites.There is another peak immediately downstream from the transcription start site with a positive expression shift, which corresponds to the binding site of the LacI repressor.Taken together, we have demonstrated that by calculating mutual information, we are able to recover binding sites from our synthetic dataset on expression levels.
Using the same procedure as described above, we can also produce synthetic datasets for other classes of regulatory architectures.In Fig 3, we demonstrate that we can recover the expected binding sites based on synthetic datasets for six common types of regulatory architectures [16].The states-and-weights diagrams and p bound expressions used to produce these synthetic datasets are shown in S3 Appendix.

Changing mutation rates and adding mutational biases
One key parameter in the MPRA pipeline is the level of mutation for each sequence variant in the library.
Here, we again consider a gene with the simple repression genetic architecture as a case study and we examine how varying mutation rates and mutational biases changes the signals in the information footprints.We quantify the level of signal, S, by calculating the average mutual information at each of the binding sites.This is given by the formula where B represents the set of bases within a given binding site, I i represents mutual information at base position i, and l is the length of B, i.e. the number of bases in the binding site.
January 29, 2024 6 As shown in Fig 4(A) and 4(B), when there is a higher rate of mutation, the average mutual information at the RNAP binding site increases relative to the average mutual information at the repressor binding site.To explain this effect, we consider , the ratio between the Boltzmann weights of the repressor and RNAP where m r and m p are the number of mutations at the repressor and RNAP binding sites, and " rd and " pd are the change in binding energies due to each mutation at the repressor and RNAP binding sites.To express  as a function of the mutation rate ✓, we can rewrite m r and m p as a product of ✓ and the lengths of repressor and RNAP binding sites l r and l p , We assume that " rd and " pd are equal to the average effect of mutations per base pair within each binding site, which can be calculated using the formula January 29, 2024 7 where " i,b is the energy contribution from position i when the base identity is b.As we are using energy matrices where the energies corresponding to the wild-type base identities are set to 0, we only need to compute the sum of the energy terms for the mutant bases b 6 = b i 2 ⇤ at each position.Since there are three possible mutant bases at each site, it follows that to find the average effect of mutations, we divide the sum of the energy matrix by 3 times the length of the binding site l.
By applying this formula to the energy matrices in Fig 2(B), we see that " rd ⇡ 2.24 k B T and , where " pd is averaged over the -38 to -30 and -15 to -5 binding sites.Moreover, l r = l p ⇡ 20 base pairs.Therefore, Since the above value is positive,  decreases with increasing mutation rate p, making the repressor bound state less likely compared to the RNAP bound state.With the repressor bound state becoming less likely, the signal in the repressor binding site goes down, since mutations changing the binding energy of the repressor change the transcription rate less significantly.As shown in S4 Appendix, when we reduce the effect of mutations on the binding energy of the repressor, we recover the signal at the repressor binding site.
Conversely, the average mutual information at the repressor binding site increases when the rate of mutation is decreased.This is because when there are very few mutations, the energy E will be less than 0 and therefore  will be greater than 1.As a result, the repressor will be preferentially bound, which blocks RNAP binding and leads to a low signal at the RNAP binding site.We can recover the signal at the RNAP binding site by increasing the binding energy between RNAP and the wild type promoter, as shown in S4 Appendix.
To find the optimal rate of mutation, we need to satisfy the condition which puts repressor and RNAP binding on an equal footing.Plugging in the values of R, P , and the energy terms and solving for ✓, we get that ✓ ⇡ 0.10.This shows that an intermediate mutation rate is optimal for maintaining high signals at all binding sites.Moreover, the information footprints for all the architectures in Fig 3 are built from synthetic datasets with a mutation rate of 10% and each footprint reflects the expected regulatory architecture.Therefore, for the remaining analysis shown in this work, we fix the mutation rate at 10%.This is similar to the mutation rate typically used in MPRAs such as Sort-Seq [8] and Reg-Seq [16].
In addition to mutation rate, another important variation in the design of the mutant library is the presence of mutational biases.For example, some mutagenesis techniques, including CRISPR-Cas9, often carry mutational biases whereby mutations within the family of purines and the family of pyrimidines have a higher efficiency compared to mutations between purines and pyrimidines [24].We build mutant libraries that incorporate two different mutational spectrums.In the first case, we allow only swaps between A and G and between C and T. For this library, we observe that the signals at both the RNAP binding site and the repressor binding site are well preserved, as shown in the left panel of

Noise as a function of library size
Another parameter that is important for library design is the total number of sequence variants in the mutant library.We build synthetic datasets with varying library sizes and computed the information footprints.To quantify the quality of signal in information footprints, we calculate signal-to-noise ratio, , according to the formula Here, I i represents the mutual information at position i, B is the set of bases within each binding site, NB is the set of bases outside the binding sites, l B is the length of the specific binding site, and l NB is the total As a result, when the library is small, there is an increased likelihood that a mutation outside of specific binding sites and a mutation at a specific binding site become correlated by chance, leading to artificial signal at the non-binding sites.
To demonstrate the hitch-hiking effect analytically, we consider a hypothetical promoter that is constitutively transcribed and only two base pairs long, as illustrated in Fig 6(A).Without loss of generality, we assume that there are only two letters in the nucleotide alphabet, X and Y. Therefore, a complete and unbiased library contains four sequences: XX, YX, XY, and YY.We designate that " X < " Y , i.e. the RNAP is strongly bound at the binding site when the base identity is X and weakly bound when the base identity is Y.We also assume that there is active transcription only when RNAP is bound to the second site.Under these assumptions, there are high expression levels when the promoter sequence is XX or YX and low expression levels when the promoter sequence is XY and YY.
We first consider a mutant library with full diversity and no bias, i.e. the four possible sequences, XX, January 29, 2024 Hitch-hiking effect in the hypothetical minimal promoter.(A) Set-up of the hypothetical minimal promoter.The specific binding sites and the non-binding sites of the minimal promoter are each 1 base-pair long.There are two possible bases at each binding site, X and Y. Strong binding occurs when the base is X, whereas weak binding occurs when the base is Y. (B) Effect of library size on the information footprint of the minimal promoter.A full mutant library consists of all four possible sequences and leads to a footprint with no signal outside of the specific binding sites.On the other hand, a reduced mutant library with only two sequences creates noise outside of the specific binding sites.In this case, the noise at the non-binding site has the same magnitude as the signal at the specific binding site.
YX, XY, and YY, are each present in the library exactly once.The marginal probability distribution for expression levels is The marginal probability distributions of base identity at the two sites are The joint probability distribution at the first site is On the other hand, the joint probability distribution at the second site is We can calculate the mutual information at each site according to Eqn 6, Therefore, when the library has the maximum size, there is perfect signal at the specific binding site and no signal outside of the specific binding site, as shown in Fig 6(B).
On the other hand, consider a reduced library that only consists of XX and YY.According to the assumptions stated above, XX has high expression and YY has low expression.In this case, there is an apparent correlation between the base identity at the non-binding site and expression levels, where a base identity of X at the non-binding site appears to lead to high expression levels and a base identity of Y at the non-binding site appears to lead to low expression levels.To demonstrate this analytically, we again write down the relevant probability distributions required for calculating mutual information.The marginal probability distributions for expression levels and base identity are the same as the case where we have a full library.However, the joint probability distributions at both of the two sites become This means that for both the non-binding site and the specific binding site, the mutual information is As shown in Fig 6(B), this creates an artificial signal, or noise, outside of the specific binding sites that cannot be distinguished from the signal at the specific binding site.
For the remaining analyses shown in this work, we use a library size of 5,000 in order to minimize noise from hitch-hiking effects.We choose not to use a larger library because it would significantly increase the computational cost during parameter searches.Moreover, we would like to use a library size that is experimentally feasible.In Reg-Seq, the average library size is 1,500 [16].A larger library would make MPRAs cost prohibitive as a high-throughput method.
2 Perturbing biological parameters in the computational pipeline

Tuning the free energy of transcription factor binding
So far, we have demonstrated that we can build synthetic datasets for the most common regulatory architectures and we have chosen the appropriate mutation rate and library size to construct mutant January 29, 2024 11 libraries.Next, we proceed to perturb parameters that affect the probability of RNAP being bound and observe the effects of these perturbations.These analyses will elucidate the physiological conditions required for obtaining clear signals from transcription factor binding events and delineate the limits of MPRA procedures in identifying unannotated transcription factor binding sites.
We again begin by considering the promoter with the simple repression motif, for which the probability of RNAP being bound is given by Eqn 1.It is known that in E. coli grown in minimal media, the copy number of RNAP is P ⇡ 10 3 [25,26] and the copy number of the repressor is R ⇡ 10 [27].The binding energy of RNAP is " pd ⇡ 5 k B T [20] and the binding energy of the repressor is " rd ⇡ 15 k B T [28].Moreover, assuming that the number of non-binding sites is equal to the size of the E. coli genome, we have that Furthermore, we define the free energy of RNAP binding as and the free energy of repressor binding as Both expressions are written according to the definition of Gibbs free energy, where the first terms correspond to enthalpy and the second terms correspond to entropy.Using these definitions, we can rewrite p bound as In this section, we specifically examine the changes in the information footprints when we tune F R .As shown in Fig 7(A) and 7(C), if we increase F R by reducing the magnitude of " rd or reducing the copy number of the repressor, we lose the signal at the repressor binding site.For example, compared to the O1 operator, the LacI repressor has weak binding energy at the O3 operator, where " rd ⇡ 10 k B T [28]. Therefore and In these cases, e F R ⌧ 1 and therefore e F R can be neglected from the denominator and the probability of RNAP being bound can be simplified to (A) mutation increases expression mutation decreases expression 2) 1 2 (B) The strength of the signal at binding sites depends on the free energy of repressor binding.(A) Increasing the binding energy of the repressor leads to an increase in average mutual information at the RNAP binding site and a decrease in average mutual information at the repressor binding site." pd is fixed at -5 k B T , RNAP copy number is fixed at 1000, and repressor copy number is fixed at 10. (B) Representative information footprints where " rd is set to 20 k B T and 10 k B T .(C) Increasing the copy number of the repressor leads to a decrease in average mutual information at the RNAP binding site and an increase in average mutual information at the repressor binding site." pd is fixed at -5 k B T and " rd is fixed at -15 k B T .RNAP copy number is fixed at 1000.(D) Representative information footprints where repressor copy numbers are set to 1 and 500.This implies that mutations at the repressor binding sites will not have a large effect on p bound and the mutual information at the repressor binding site will be minimal.
On the other hand, if we decrease the free energy of binding either by increasing the magnitude of " rd or increasing the copy number of the repressor, it leads to a stronger signal at the repressor binding site while significantly reducing the signal at the RNAP binding site, as we can see in Fig 7(A) and 7(C).For example, when " rd = 20k B T , we have that and therefore Here, the Boltzmann weight of the repressor has been increased a hundred fold compared to Eqn 26.Due to the strong binding of the repressor, mutations at the RNAP binding site do not change expression on measurable levels and therefore the signal is low at the RNAP binding site.
January 29, 2024 13 In particular, we see in Fig 7(A) that when the repressor energy is increased beyond 11 k B T , the average mutual information at the RNAP binding site saturates and the average mutual information at the repressor binding site remains close to 0. To explain this effect, we again take a look at the ratio between the Boltzmann weights of the repressor and RNAP, the expression for which is stated in Eqn 11.Here, we fix the copy number of the repressors and RNAP, the wild-type binding energy of RNAP, the number of mutations, and the effect of mutations.Therefore, = 10 1000 • e " rd 5 2⇥2.24+2⇥0.36 We assume that  needs to be at least 0.1 for there to be an observable signal at the repressor binding site.Solving for " rd using the above equation, we have that " rd ⇡ 11 k B T .This matches with our observation that the signal stabilizes when " rd > 11 k B T .Taken together, these result invite us to rethink our interpretation of MPRA data, as the lack of signal may not necessarily indicate the absence of binding site, but it may also be a result of strong binding or low transcription factor copy number.

Changing the regulatory logic of the promoter
In the previous section, we examined the changes in information footprints when we tune the copy number of the repressors under the simple repression regulatory architecture.The effect of transcription factor copy numbers on the information footprints is more complex when a promoter is regulated by multiple transcription factors.In particular, the changes in the information footprints depend on the regulatory logic of the promoter.To see this, we consider a promoter that is regulated by two repressors.For a double-repression promoter, there are many possible regulatory logics; two of the most common ones are an AND logic gate and an OR logic gate.As shown in Fig 8(A), if the two repressors operate under AND logic, both repressors are required to be bound for repression to occur.This may happen if each of the two repressors bind weakly at their respective binding sites but bind cooperatively with each other.On the other hand, if the two repressors operate under OR logic, then only one of the repressors is needed for repression.We generate synthetic datasets for an AND-logic and an OR-logic double-repression promoter that are regulated by repressors R 1 and R 2 .As shown in Fig 8 (B) and 8(C), under AND logic, there is no signal at either of the repressor binding sites when R 1 is set to 0. This matches our expectation because AND logic dictates that when one of the two repressors is absent, the second repressor is not able to reduce the level of transcription by itself.We recover the signal at both of the repressor binding sites even when there is only one copy of R 1 .Interestingly, when R 1 > R 2 , the signal at the R 1 binding site is greater than the signal at the R 2 binding site.This may be because the higher copy number of R 2 compensates for the effects of mutations and therefore expression levels are affected to a greater extent by mutations at the R 1 binding site than mutations at the R 2 binding site.
In comparison, since the two repressors act independently when they are under OR logic, the signal at the R 2 binding site is preserved even when R 1 = 0.Moreover, the state where R 1 represses transcription competes with the state where R 2 represses transcription.As a result, when R 1 is increased, the signal at the R 1 binding site is increased whereas the signal at the R 2 binding site decreases.
We observe similar changes in the information footprint for a promoter that is bound by two activators, as illustrated in S5 Appendix.These results are informative in the context of transcription factor deletion, which is a key approach for identifying and verifying which transcription factor binds to the putative binding sites discovered in MPRA pipelines [16].The final copy number of the transcription factor depends on which experimental method is chosen to perform the deletion.If the gene coding for the transcription factor is knocked out, no transcription factor will be expressed and the transcription factor copy number will be 0. Therefore, by comparing the footprints from the wild-type strain and the transcription factor deletion strain, we can locate the site where the signal disappears and deduce which transcription factor is bound at that site.On the other hand, if knock-down methods such as RNAi are used, some leaky expression may take place and the transcription factor copy number may be low but non-zero.In this case, there may not be appreciable differences in the footprints from the wild-type strain and the deletion strain.This would be an January 29, 2024

Competition between transcription factor binding sites
Thus far, we have assumed that each transcription factor only has one specific binding site in the genome.
However, many transcription factors bind to multiple promoters to regulate the transcription of different genes.For example, cyclic-AMP receptor protein (CRP), one of the most important activators in E. coli, regulates 330 transcription units [29].Therefore, it is important to understand how the relationship between sequence and binding energy changes when the copy number of the transcription factor binding site is changed.
Binding site copy number is also highly relevant in the context of the experimental MPRA pipeline.
When E. coli is the target organism, there are two main ways of delivering mutant sequences into the cell.
As illustrated in Fig 9(A), the first is by directly replacing the wild type promoter with the mutant promoter using genome integration methods such as ORBIT [30,31].In this case, we preserve the original copy number of the binding sites.The second method is to transform the bacterial cells with plasmids carrying the promoter variant.If this approach is used, the number of transcription factor binding sites will increase by the copy number of the plasmids.We would like to understand precisely how the signal in the resulting information footprint differs between a genome integrated system and a plasmid system.
To build a synthetic dataset that involves more than one transcription factor binding site, we once again begin by building a thermodynamic model to describe the different binding events.However, in the canonical thermodynamic model that we utilized earlier, introducing multiple transcription factor binding sites would lead to a combinatorial explosion in the number of possible states.To circumvent this issue, we introduce an alternative approach based on the concept of chemical potential.Here, chemical potential corresponds to the free energy required to take an RNAP or a transcription factor out of the cellular reservoir.As shown in Changing the copy number of transcription factor binding sites.(A) There are two ways of delivering sequence variants into a cell.If the promoter variant is integrated into the genome, the original copy number of the promoter is preserved.On the other hand, if the cells are transformed with plasmids containing the promoter variant, binding site copy number will increase.When the copy number of the binding site is high, the additional binding sites titrate away the repressors and the gene will be expressed at high levels despite the presence of repressors.(B) States-and-weights models for a constitutive promoter in an isolated system and in contact with a cellular reservoir of RNAPs.To build a thermodynamic model for an isolated system, we assume that there are no free-floating RNAPs and we require that the number of bound RNAPs is equal to the total number of RNAPs in the system.On the other hand, for a system that is in contact with a reservoir, we only need to ensure that the average number of RNAPs bound matches the total number of RNAPs.(C) Average mutual information at the repressor binding site decreases when the number of the repressor binding site is increased.Repressor copy number is set to 10 for all data points.(D) Representative information footprints for cases where there is only 1 repressor binding site and when there are 50 repressor binding sites.
Using the method of chemical potential, we construct synthetic datasets with different repressor binding site copy numbers.As shown in Fig 9(C), as the copy number of the repressor binding site is increased, the signal at the repressor binding site decreases rapidly and eventually stabilizes at a near-zero value.In particular, as shown in Fig 9(D), in a genome integrated system where there is only one copy of the repressor January 29, 2024 16 binding site, there is clear signal at the repressor binding site.On the other hand, in a plasmid system where the copy number of the binding site is greater than the copy number of the repressor, the signal for the repressor disappears.Intuitively, this is because the additional binding sites titrate away the repressors, which reduces the effective number of repressors in the system.As a result, the expression of the reporter gene no longer reflects transcriptional regulation by the repressor.This reduces the effect of mutations at the repressor binding site on expression levels, which leads to low mutual information between mutations and base identities at the repressor binding site.
In wild type E. coli, the median ratio of transcription factor copy number and binding site copy number is around 10 [32], and therefore the titation effects are unlikely to diminish the signals in information footprints when the sequence variant is integrated into the genome.On the other hand, if a plasmid system is used, it is beneficial to make use of a low copy number plasmid.Although we have no knowledge of which transcription factor is potentially regulating the gene of interest and therefore we do not know a priori the copy number of the transcription factor, using a low copy number plasmid has a greater chance of ensuring that the copy number of the transcription factor binding sites is no greater than the copy number of the putative transcription factor.

Changing the concentration of the inducer
So far, in the regulatory architectures that involve repressor binding, we have only considered repressors in the active state, whereas in reality the activity of the repressors can be regulated through inducer binding.Specifically, according to the Monod-Wyman-Changeux (MWC) model [33], the active and inactive states of a repressor exist in thermal equilibrium and inducer binding may shift the equilibrium in either direction.If inducer binding shifts the allosteric equilibrium of the repressor from the active state towards the inactive state, the repressor will bind more weakly to the promoter.This will increase the probability of RNAP being bound and therefore lead to higher expression.In other words, increasing inducer concentration has similar effects to knocking out the repressor from the genome.For example, when lactose is present in the absence of glucose, lactose is converted to allolactose, which acts as an inducer for the Lac repressor and leads to increased expression of genes in the LacZYA operon.Conversely, some inducer binding events may also shift the equilibrium of a repressor from the inactive state towards the active state.One example is the Trp repressor, which is activated upon tryptophan binding and represses gene expression.Here, we use the example of the lacZYA operon and demonstrate how signals in the information footprint depend on the concentration of the allolactose inducer in the system.
As shown in Fig 10(A), to include an inducible repressor in our thermodynamic model, we add an additional state in the states-and-weights diagram that accounts for binding between the inactivated repressor and the promoter.This additional state is a weak binding state where the repressor is more likely to dissociate from the binding site.In many cases, transcription factors have multiple inducer binding sites.Here, we choose a typical model where the repressor has two inducer binding sites.Based on the new states-and-weights diagram, the probability of RNAP being bound can be rewritten according to the following expression [34] where P is the number of RNAPs, R A is the number of active repressors, R I is the number of inactive repressors, and " pd , " A rd , " I rd correspond to the energy differences between specific and non-specific binding of the RNAP, the active repressor, and the inactive repressor respectively.
In order to calculate the probability of RNAP being bound, we need to determine the proportion of R A and R I with respect to the total number of repressors.To do this, we calculate p active (c), the probability that the repressor exists in the active conformation as a function of the concentration of the inducer, c.To calculate p active (c), we model the different states of the repressor using another states-and-weights diagram, as illustrated in Fig 10(B).The probability that the repressor is in the active state is where K A is the dissociation constant between the inducer and the active repressor, K I is the dissociation constant between the inducer and the inactive repressor, and " AI is the structural energy difference between the active repressor and the inactive repressor.This allows us to represent the number of active and inactive repressors as R A = p active R and R I = (1 p active )R.Therefore, our expression for p bound can be modified to We built synthetic datasets for a promoter with the simple repression regulatory architecture with an inducible repressor.As shown in Fig 10(C) and 10(D), when the concentration of the inducer is increased from 10 2 K A to 1 K A , the average signal at the repressor binding site decreases.Interestingly, the average signal is not reduced further when the concentration is increased beyond the value of K A .As shown in S6 Appendix, similar results are observed in the case of a simple activation promoter with an inducible activator.These results show that the presence or absence of inducers can determine whether we will obtain a signal at the transcription factor binding site.This underlies the importance of performing experimental MPRAs under different growth conditions to ensure that we can identify binding sites that are bound by transcription factors that are induced by specific cellular conditions.These efforts may fill in the gap in knowledge on the role of allostery in transcription, which so far has been lacking attention from studies in the field of gene regulation [35].January 29, 2024 3 Identifying transcription factor binding sites from information footprints

Noise due to stochastic fluctuations of RNAP and transcription factor copy number
When data from MPRAs is presented in the form of information footprints, one way to annotate transcription factor binding sites is to identify regions where the signal is clearly higher than background noise.Therefore, to be able to precisely and confidently identify RNAP and transcription factor binding sites from information footprints, the footprint is required to have a sufficiently high signal-to-noise ratio.
However, this may not always be the case.For example, the footprint shown in Fig 11 is obtained for the mar operon by Ireland et al. [16]; while the signal at the RNAP binding sites and the -20 MarR binding sites are clearly above the background noise, the signals at the Fis, MarA, and +10 MarR binding sites may easily be mistaken for noise.

mutation increases expression mutation decreases expression
Fig 11 .Annotating transcription factor binding sites by identifying sites with high signal.The footprint of the marR operon, produced by Ireland et al. [16] The binding sites are annotated based on known RNAP and transcription factor binding sites; the signal at some of the binding sites, such as the Fis and MarA binding sites, are not distinguishable from background noise.
In Sec 1.3, we examined how the size of the mutant library may affect the level of noise in information footprints.Here, we continue to examine other factors that may affect signal-to-noise ratio.We first simulated possible sources of experimental noise, including PCR amplification bias and random sampling effects during RNA-Seq library preparation and RNA-Seq itself.However, as shown in S7 Appendix, these experimental processes do not lead to significant levels of noise outside of the specific binding sites.
Another potential source of noise is from the biological noise that contributes to stochastic fluctuations in expression levels.These sources of biological noise can be broadly categorized into intrinsic noise and extrinsic noise [36][37][38].Intrinsic noise arises from the inherent stochasticity in the process of transcription, such as changes in the rate of production or degradation of mRNA.On the other hand, extrinsic noise arises from cell-to-cell variation in the copy number of transcription machineries such as the RNAPs and transcription factors.It has been shown that extrinsic noise occurs on a longer timescale and has a greater effect on phenotypes than intrinsic noise [38].Here, we investigate whether extrinsic noise has an effect on the information footprints.
We build synthetic datasets for a promoter with the simple repression genetic architecture using the same procedure as before.However, we no longer specify the copy number of RNAPs and repressors as a constant integer.Instead, as described in S8 Appendix, we draw a random number for the number of RNAPs and repressors from a log-normal distribution, which is the distribution that the abundance of biomolecules often follow [39].As shown in Fig 12(A) and 12(B), we observe that the signal-to-noise ratio does decrease when the extrinsic noise is higher.However, we can still distinguish between signal and noise even when we specify a large variance.This suggests that information footprints as a read-out are robust to extrinsic noise.This phenomenon may be explained by the fact that the changes in binding energies due to mutations in the promoter sequence have a much greater contribution to the probability of RNAP being bound than changes in the copy number of transcription factors.Assuming that RNAP binds weakly to the promoter, the expression for p bound in Eqn 1 can be simplified to Based on the experimentally measured energy matrix for LacI [20], the average increase in " rd due to one mutation is approximately 2 k B T .The LacI binding site is 21 base pairs long.Therefore, with a 10% mutation rate, there are on average 2 mutations within the LacI binding site, and the total change in " rd would be approximately 4 k B T .This leads to a e " rd = 0.01 fold change in the magnitude of R NNS e " rd .This means that the copy number of the repressor would have to change by a factor of 100 to overcome the effect of mutations, and this is not possible through fluctuations due to extrinsic noise.Therefore, extrinsic noise by itself will not lead to a sufficiently large change in expression levels to affect the signals in the information footprint.

Spurious binding site along the promoter
RNAPs with 70 typically bind to the TATA box, which is a motif with the consensus sequence TATAAT located at the -10 site along the promoter.However, not all nucleotides within the TATAAT motif are required for binding.In particular, each of the bases from the third to the fifth positions (TAA) is only found in around 50% of RNAP binding sites [40].This makes it likely that there exist TATAAT-like motifs away from the -10 site.If RNAPs bind at these spurious sites, it could initiate transcription from a different transcription start site and produce mRNAs.We hypothesize that these spurious binding events outside of the canonical RNAP binding site may lead to noise in information footprints.
To investigate the effect of spurious binding on information footprints, we build a thermodynamic model that includes RNAP binding at every possible position along the promoter as a unique state, as shown in Fig 13(A).The weight of each of these states is calculated by mapping the energy matrix to the corresponding sequences at each position.As shown in Fig 13(B), we find that spurious binding decreases the signal within the canonical -35 and -10 binding sites while increasing noise outside of these specific binding sites.However, as is with the case of extrinsic noise, this source of noise is not at a sufficiently high level to interfere with our ability to annotate binding sites.This implies that reducing the hitch-hiking effects described in Sec 1.3 should be the primary focus when a high signal-to-noise ratio is desired.

Overlapping binding sites
Other than a low signal-to-noise ratio, another factor that may contribute to the challenge of annotating binding sites is when binding sites overlap.This is especially common with RNAP and repressor binding sites, since a common mechanism by which repressors act to reduce expression is by binding to the RNAP binding site and thereby sterically blocking RNAP binding.For example, in Fig 11,   States-and-weights diagram of a simple repression promoter where spurious RNAP binding is allowed.For each of the RNAP spurious binding events, the binding energy, " pd,i , is computed by mapping the RNAP energy matrix to the spurious binding site sequence.The index i correspond to the position of the first base pair to which RNAP binds along the promoter.0 is at the start of the promoter sequence; k is at the canonical RNAP binding site; n = 160 l p is index of the most downstream binding site where the promoter is assumed to be 160 base-pair long and l p is the length of the RNAP binding site.(B) Information footprints of a promoter under the simple repression regulatory architecture without (top) and with (bottom) spurious RNAP binding.
and repressor binding sites overlap, and we examine how the presence of these overlapping sites affect the signal in information footprints.
Assuming perfect binding sites, a mutation in the RNAP binding site will decrease expression and a mutation in the repressor binding site will increase expression.As shown in Fig 14(A), when the RNAP and repressor binding sites do not overlap, we see well-separated signals in the footprint.However, if the binding sites overlap, we expect either the signals from the two binding events will cancel out or one signal will dominate the other.
We build two synthetic datasets with overlapping binding sites.In the first synthetic dataset, the

Discussion
In this paper, we explore the landscape of sequence-energy-phenotype mapping by utilizing a computational pipeline that simulates MPRA pipelines such as Reg-Seq.More generally, our computational pipeline makes it possible to use statistical mechanical models of gene expression to systematically explore the connection between mutations and level of gene expression.Using this pipeline, we have examined the effects of perturbing various experimental and biological parameters.These perturbations occur at multiple stages of the pipeline.Some parameters pertain to the initial library design, such as library size, mutation rate, and presence of mutational bias.Other parameters are built into the model itself, such as the copy number of the promoter and the transcription factors, which are parameters that may vary biologically and may also be affected by the design of experimental procedures.
We have demonstrated that our computational pipeline has high flexibility and can easily be adapted to examine the effects of other perturbations not included in this paper.Furthermore, the computational nature of the pipeline allows full parameter searches to be done precisely and efficiently.For example, it would be both time-consuming and cost-prohibitive to experimentally determine the optimal library size and mutation rate as it would involve performing a large array of experimental test runs.On the other hand, using our computational pipeline, we can efficiently build a series of synthetic datasets with different mutant libraries and determine the strategy for library design that is optimal for deciphering regulatory architectures.
Apart from informing the choice of experimental parameters, the pipeline also helps to anticipate challenges involved in parsing information footprints.For example, in Sec 3.3, we predict how the signal in information footprints would be affected when there are overlapping binding sites.One potential usage of this pipeline is for building synthetic datasets that involve features that could lead to information footprints that are hard to parse.Since the synthetic datasets are built with prior knowledge of the underlying regulatory architectures, these datasets can be used to develop and improve algorithms for deciphering these architectures.This will increase confidence in the results when the same algorithms are used to analyze experimental datasets and determine the location of binding sites.Moreover, this will pave the way for automatically annotating binding sites for any given information footprint given MPRA data.
One limitation of our computational MPRA pipeline is that it neglects the interaction between different genes in regulatory networks, which affects expression levels and may alter the expected signal in information footprints.A future direction, therefore, involves building synthetic datasets of genetic networks.This would require an additional step where we modify the expression levels of each gene based on its dependency on other genes.This would not only improve the reliability of our prediction of expression levels, but these multi-gene synthetic datasets may also be used to test approaches for deciphering the architecture of regulatory networks.January 29, 2024 22 Furthermore, we acknowledge that our current computational pipeline only considers transcription initiation factors, while other types of transcription factors, such as elongation factors and termination factors, are also important for determining gene expression.It is challenging to include these factors into our model as it would involve additional kinetics terms that would have to be worked out, though we are extremely interested in developing these approaches as well in our future work.While this work does not directly address the challenge in understanding the role of transcription elongation factors and termination factors, we believe that by achieving a full understanding of transcription initiation factor binding through the efforts of both the computational and experimental MPRA pipelines, it will help streamline strategies needed to decipher the roles of other types of transcription factors.
In summary, we have developed a theoretical framework for a widely used category of experiments in the field of transcriptional regulation.Our simulation platform establishes a systematic way of testing how well high-throughput methods such as the MPRAs can be used to recover the ground truth of how the expression of a gene is transcriptionally regulated.This demonstrates the importance of developing theories of experiments in general, and we believe there is much untapped potential in extending similar types of theories to other areas of experimental work as well.Finally, we anticipate that this approach will also be very useful in performing systematic studies on the relation between mutations in regulatory binding sites and the corresponding level of gene expression in a way that will shed light on both physiological and evolutionary adaptation.

Fig 1 .
Fig 1.A computational pipeline for deciphering regulatory architectures from first principles.Given(1) knowledge or assumptions about the regulatory architecture of a promoter, we make use of (2) thermodynamic models to construct a states-and-weights diagram, which contains information about all possible states of binding and the associated Boltzmann weights.Here, in the states-and-weights diagram, P is the copy number of RNAP, R is the copy number of the repressor, N NS is the number of non-binding sites, " pd and " rd represent the binding energies of RNAP and the repressors at their specific binding sites relative to the non-specific background, respectively.Using these states-and-weights diagrams as well as the energy matrices, which are normalized to show the change in binding energies for any mutation along the promoter compared to the wild-type sequence, we can (3) predict the expression levels for each of the promoter variants in a mutant library.To recover the regulatory architecture, we (4) calculate the mutual information between the predicted expression levels and mutations at each position along the promoter according to Eqn 6.In particular, there is high mutual information if a mutation leads to a large change in expression and there is low mutual information if a mutation does not lead to a significant change in expression.The mutual information at each position is plotted in an information footprint, where the height of the peaks corresponds to the magnitude of mutual information, and the peaks are colored based on the sign of expression shift, defined in Eqn 9. Given the assumption that the positions with high mutual information are likely to be RNAP and transcription factor binding sites, we (5) recover the regulatory architecture of the promoter.

Fig 2 .
Fig 2.Mapping binding site sequences to binding energies using energy matrices.(A) Given the assumption that binding energies are additive, we can use an energy matrix to determine how much energy each base along the binding site contributes and compute the total binding energy by taking the sum of the binding energies contributed by each position.The total binding energy can be used to compute the Boltzmann weight for each of the states, which is then used to calculate the probability of RNAP being bound.(B) Experimentally-measured energy matrices of RNAP[20], the LacI repressor[21], and the CRP activator.[8]

Fig 3 .
Fig 3. Building information footprints based on synthetic datasets of different regulatory architectures.We describe each of the regulatory architectures using the notation (A,R), where A refers to the number of activator binding sites and R refers to the number of repressor binding sites.The corresponding information footprints built from synthetic datasets are shown on the right.The architectures shown in panels (A)-(F) are a constitutive promoter, simple repression, simple activation, repression-activation, double repression, and double activation, respectively.For panels (A)-(D), we use energy matrices of RNAP, LacI, and CRP previously measured using Sort-Seq, shown in Fig 2(B).For panels (E) and (F), we continue to use the experimentally measured energy matrix for RNAP.The energy matrices for the repressors and the activators are constructed by hand, where the interaction energies at the wild type bases are set to 0 k B T and the interaction energies at the mutant bases are set to 1 k B T .
Fig 4(C).In the second case, we only allow mutations from G to A and from C to T without allowing the reverse mutations.As shown in the right panel of Fig 4(C), due to only two bases being allowed to mutate, only a few, possibly low-effect mutations are observed, making small regions such as the -10 and -35 sites hard to detect.These results show that information footprints are robust to mutational biases provided that most sites are allowed to mutate.

1 ) 2 ) 3 )Fig 4 .
Fig 4. Changing mutation rate and adding mutational biases.(A) Changes in the average mutual information at the RNAP and at the repressor binding sites when the mutation rate of the mutant library is increased.Average mutual information is calculated according to Eqn 10.Each data point is the mean of average mutual information across 20 synthetic datasets with the corresponding mutation rate.(B) Representative information footprints built from synthetic datasets with mutation rates of 0.04, 0.1, and 0.2.(C) Information footprints built from synthetic datasets where the mutant library has a limited mutational spectrum.The left panel shows a footprint where mutations from A to G, G to A, T to C, and C to T are allowed.The right panel shows a footprint where only mutations from G to A and from C to T are allowed.

Fig 5 .
Fig 5. Noise as a function of library size.(A) Signal-to-noise ratio increases as library size increases.Signal-to-noise ratio is calculated according to Eqn 16. (B) Representative information footprints with a library size of 100, 500, and 1000.

Fig 8 .
Fig 8. Changing repressor copy number for a double-repression promoter.(A) States-and-weights diagram of a promoter with the double repression regulatory architecture.The bottom two states are only present under AND logic and not present under OR logic.The states-and-weights diagram of a double repression promoter with OR logic is also shown in Fig S2(E).(B) Changing the copy number of the first repressor under AND logic and OR logic affects the signal at both repressor binding sites.For the energy matrices of the repressors, the interaction energy between the repressor and a site is set to 0 k B T if the site has the wild-type base identity and set to 1 k B T if the site has the mutant base identity.The interaction energy between the repressors is set to 5 k B T .All data points are the mean of average mutual information across 20 synthetic datasets with the corresponding parameters.(C) Representative information footprints of a double repression promoter under AND and OR logic.

Fig 9 (
Fig 9(B), it is convenient to use chemical potential because in contrast to an isolated system, the resulting model no longer imposes a constraint on the exact number of RNAP or transcription factor bound to the

Fig 10 .
Fig 10.Changing the concentration of the inducer.(A) States-and-weights diagram for an inducible repressor.(B) States-and-weights diagram to calculate the probability that the repressor is in the active state.(C) Average mutual information at the repressor binding site decreases as the inducer concentration increases.Here, we let K A = 139 ⇥ 10 6 M, K I = 0.53 ⇥ 10 6 M, and " AI = 4.5 k B T .The thermodynamic parameters were inferred by Razo-Mejia et al. [34] from predicted IPTG induction curves.The inducer concentration on the x-axis is normalized with respect to the value of K A .(D) Representative information footprints with low inducer concentration (10 6 M) and high inducer concentration (10 3 M).

1 ) 2 )Fig 12 .
Fig 12.  Adding extrinsic noise to synthetic datasets.(A) The presence of extrinsic noise lowers the signal-to-noise ratio in information footprints.(B) Representative information footprints with no extrinsic noise and high extrinsic noise, where the standard deviations in the log-normal distributions of the RNAP and repressor copy numbers are 0.5 times the mean.

Fig 13 .
Fig 13.Spurious RNAP binding is common and may reduce signal-to-noise ratio.(A) States-and-weights diagram of a simple repression promoter where spurious RNAP binding is allowed.For each of the RNAP spurious binding events, the binding energy, " pd,i , is computed by mapping the RNAP energy matrix to the spurious binding site sequence.The index i correspond to the position of the first base pair to which RNAP binds along the promoter.0 is at the start of the promoter sequence; k is at the canonical RNAP binding site; n = 160 l p is index of the most downstream binding site where the promoter is assumed to be 160 base-pair long and l p is the length of the RNAP binding site.(B) Information footprints of a promoter under the simple repression regulatory architecture without (top) and with (bottom) spurious RNAP binding.
repressor binds between the -10 and the -35 sites.As shown inFig 14(B), the signals from repressor and RNAP binding are merged.Compared to the footprint where the binding sites do not overlap, it is harder to delineate the binding sites for either the repressor or RNAP at a base-by-base resolution.However, the observation that both positive and negative signals are present can lead to the hypothesis that the RNAP and the repressor bind to the same site.On the other hand, in the second synthetic dataset, the repressor blocks RNAP by binding to the -10 site.As shown in Fig14(C), the signal from repressor binding dominates the signal from RNAP binding.Without prior knowledge that RNAP binds at this position, such a footprint could lead to the erroneous conclusion that only repressors bind to this site.These analyses demonstrate the challenge of deciphering regulatory architectures in the presence of overlapping binding sites.This may be overcome by tuning growth conditions to reduce binding by some of the overlapping binding partners, such that we can obtain cleaner footprints with signal indicating individual binding events.

Fig 14 .
Fig 14.The presence of overlapping binding sites obscures signal in the information footprint.(A) Information footprint with no overlapping binding sites.(B) Information footprint where the repressor binds between the -10 and -35 RNAP binding sites.(C) Information footprint where the repressor binds to the -10 site where RNAP also binds.For consistency, in each of the plots, the repressor energy matrix is constructed based on the designated binding site sequence, where the binding energy at the wild-type base identity is set to 0 k B T and the binding energy at all other base identities is set to 1 k B T .
10 6.Given these values, we can estimate that " pd ⌧ 1, we can neglect this term from the denominator in Eqn 1 and simplify p bound for the simple repression motif to we can see that the MarR binding site overlaps with the RNAP binding site.Here, we build synthetic datasets where the RNAP